Airbnb, Inc. is an online marketplace for arranging or offering lodging, primarily homestays, or tourism experiences. The company does not own any of the real estate listings, nor does it host events; it acts as a broker, receiving commissions from each booking.
The purpose of this notebook is to perform an exploratory data analysis on the various Airbnb listings data in New York City for the year 2019.
The data, sourced from Kaggle, contains all the Airbnb listings in New York for the year 2019.
Loading the required libraries, frameworks and data:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
import folium
import folium.plugins
from folium.plugins import MarkerCluster
from folium import plugins
import wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from wordcloud import WordCloud, ImageColorGenerator
import plotly.express as px
df = pd.read_csv(r'C:\Users\benny\OneDrive\Desktop\Misc\RandC\Projects\NYC Airbnb Price Analysis\AB_NYC_2019.csv')
df.head()
Checking if there are any null values in the data:
df.isnull().sum() #finding null values
We notice that there are quite a few null values in some of the variables. These null values might interefere with our analysis and hence, it is necessary to either remove the null values completely or replace them with a suitable value. Let's replace the null values in the variable 'Reviews per month' with zero and the null values in 'name', 'host name' and 'last review' variables with 'Not Specified'.
df.fillna({'reviews_per_month':0}, inplace=True) #replace null values in reviews_per_month by zero
df["name"].fillna("Not Specified", inplace = True)
df["host_name"].fillna("Not Specified", inplace = True)
df["last_review"].fillna("Not Specified", inplace = True)
df.isnull().sum()
df.describe()
Let's take a look at how the variables correlate to the price variable.
CM = df[['price','neighbourhood_group','neighbourhood', 'latitude', 'longitude', 'room_type','minimum_nights','number_of_reviews',
'reviews_per_month','calculated_host_listings_count', 'availability_365']]
cor = CM.corr() #Calculate the correlation of the above variables
cm=sns.heatmap(cor, cmap = 'Greens', square = True) #Plot the correlation as heat map
cm.set_xticklabels(cm.get_xticklabels(),
rotation=45,
horizontalalignment='right')
#not very strong correlations except for number of reviews and reviews per month
The variables don't seem to correlate very strongly with the price variable apart from the 'number of reviews' and 'reviews per month' variables, depicted using the darker shade of green.
In this section of the notebook, we will explore the data through various visualizations.
fig1, ax1 = plt.subplots(figsize=(5,5))
# Create a pie chart
ax1.pie(
df['neighbourhood_group'].value_counts(),
labels = df.neighbourhood_group.unique(),
pctdistance=0.85,
explode = (0.05,0.05,0.05,0.05,0.05),
colors = ['orange', 'coral', 'gold', 'yellow', 'pink', ],
# with the percent listed as a fraction
autopct='%1.1f%%',
)
#for donut chart
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig2 = plt.gcf()
fig2.gca().add_artist(centre_circle)
ax1.set_title('Airbnb distribution in different Boroughs')
# Equal aspect ratio ensures that pie is drawn as a circle
ax1.axis('equal')
plt.tight_layout()
plt.title('Borough wise listings distribution')
plt.show()
#manhattan and brooklyn have most airbnbs
From the donut chart, Brooklyn and Manhattan share the most number of Airbnb listings followed by Queens, with Staten Island and Bronx having the least percentage of Airbnb listings.
Let's visualize the number of listings on the New York City Map.
plt.figure(figsize=(15,6))
sns.scatterplot(df.longitude,df.latitude,hue=df.neighbourhood_group)
The 3D visualizations below show the price distribution of the listings by borough and room type. Staten Island seems to contain the least expensive listings as compared to all the other boroughs.
fig = px.scatter_3d(df, x='latitude', y='longitude', z='price',
color='neighbourhood_group')
fig.show()
fig = px.scatter_3d(df, x='room_type', y='neighbourhood_group', z='price',
color='room_type')
fig.show()
Top 5 neighbourhoods with the least expensive listings:
Inexpensive_neighbourhoods = df.groupby('neighbourhood').agg({'price': 'mean'}).sort_values('price').reset_index()
plt.figure(figsize=(12,6))
sns.barplot(y="neighbourhood", x="price", palette = 'Greens', data=Inexpensive_neighbourhoods.head(5))
plt.ioff()
Top 5 neighbourhoods with the most expensive listings:
plt.figure(figsize=(12,6))
sns.barplot(y="neighbourhood", x="price", palette = 'Reds', data=Inexpensive_neighbourhoods.tail(5))
plt.ioff()
Plotting all the listings in New York City on a heat map based on their price distribution:
plt.figure(figsize=(10,6))
sub=df[df.price<500]
scat=sub.plot(kind='scatter', x='longitude',y='latitude', label = 'price', cmap = 'Purples', c='price',colorbar=True,figsize=(10,10));
scat.legend()
Brooklyn contains the most number of private room listings. Manhattan tops the list of Entire home/apartment listings. Shared rooms aren't very popular amongst any of the boroughs.
plt.figure(figsize=(10,6))
sns.countplot(x = 'room_type',hue = 'neighbourhood_group', palette = 'RdPu', data = df)
plt.title('Room types vs Boroughs')
plt.show()
Most popular rooms based off the number of reviews:
most_popular=df.sort_values(by=['number_of_reviews'],ascending=False).head(100)
most_popular.head()
print('Most Popular Rooms')
map=folium.Map(location = [40.73,-73.93])
map_rooms=plugins.MarkerCluster().add_to(map)
for lat,lon,label in zip(most_popular.latitude,most_popular.longitude,most_popular.name):
folium.Marker(location=[lat,lon],icon=folium.Icon(icon='fire'),popup=label).add_to(map_rooms)
map.add_child(map_rooms)
map
Lastly, let's create a wordcloud to visualize the most commonly used words in listings:
#word cloud
text = " ".join(str(each) for each in df.name)
# Create and generate a word cloud image:
wordcloud = WordCloud(max_words=50, background_color="white").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Through the explorartory data analysis conducted on the Airbnb Listings in New York City for the year 2019, we have successfully extracted a number of insights from the data, and through the different visualizations, identified the many trends and patterns present in the data.